An automated discovery tool for discovering hidden biological and technical links
knowYourCG is a tool for evaluating CpG feature enrichment using Illumina probe IDs. Tthis tool automates the hypothesis testing by asking whether a set of CpGs (indexed by Illumina methylation chip IDs, hence a sparse representation of the methylome) is enriched in certain categories or features. These categories or features can be categorical (e.g., CpGs located at specific tissue-specific transcription factors) or continuous (e.g., the local CpG density of CpGs). Additionally, the set of CpGs to which the test will be applied can be categorical or continuous as well.
The set of CpGs that will be tested for enrichment is called the query set, and the set of CpGs that will be used to determine enrichment is called the database set. A query set, for example, might be the results of a differential methylation analysis or from an epigenome-wide association study. We have taken the time to curate our own database sets from a variety of sources that describe different categorical and continuous features such as technical characterization of the probes, CpGs associated with certain chromatin states, gene association, transcription factor binding sites, CpG density, etc.
Additionally, knowYourCG has support for feature selection and feature engineering, which is currently in development.
The following commands prepares the use of KnowYourCG:
We have organized our database sets into different groups. Each group contains one or multiple databases. Here is how to find the names of these database groups:
## [1] "KYCG.MM285.metagene.20220126"
## [2] "KYCG.MM285.gene.GENCODEvM25.20220116"
## [3] "KYCG.MM285.HM.20220116"
## [4] "KYCG.MM285.HMconsensus.20220116"
## [5] "KYCG.MM285.TFBS.20220116"
## [6] "KYCG.MM285.TFBSconsensus.20220116"
## [7] "KYCG.MM285.tissueSignature.20211211"
## [8] "KYCG.MM285.seqContextN.20210630"
## [9] "KYCG.MM285.seqContext.20210630"
## [10] "KYCG.MM285.probeType.20210630"
## [11] "KYCG.MM285.designGroup.20210210"
## [12] "KYCG.MM285.chromosome.mm10.20210630"
## [13] "KYCG.MM285.chromHMM.20210210"
The KYCG_listDBGroups() function returns a vector containing these accessions. With the accessions, one can use the KYCG_getDBs() function to get a list of databases. When this function is ran for the first time, none of the databases have been cached. Caching on the local machine is important on two fronts: firstly it limits the number of requests sent to the Bioconductor server, and secondly it limits the amount of time the user needs to wait when re-downloading database sets. For this reason, one should run sesameDataCacheAll() before loading in any database sets. This will take some time to download all of the database sets from a given release. During the analysis the database sets can be identified using these accessions. Sesame also support some degree of guessing when a unique substring is given.
For example, the following retrieves the “KYCG.MM285.designGroup.20210210” database.
## Selected the following database groups:
## 1. KYCG.MM285.designGroup.20210210
In total, 32 datasets have been loaded for this group. We will show the first two for brevity.
## List of 2
## $ Enhancer: chr [1:58759] "cg36603393_BC21" "cg36605130_BC21" "cg36624203_BC21" "cg36624767_TC21" ...
## ..- attr(*, "group")= chr "KYCG.MM285.designGroup.20210210"
## ..- attr(*, "dbname")= chr "Enhancer"
## $ Random : chr [1:27999] "cg36603791_TC21" "cg36608073_BC21" "cg36608947_BC11" "cg36609796_TC21" ...
## ..- attr(*, "group")= chr "KYCG.MM285.designGroup.20210210"
## ..- attr(*, "dbname")= chr "Random"
On subsequent runs of the KYCG_getDBs() function, loading specific database sets from the same release will be much faster.
A query set represents probes of interest. It may either be in the form of a character vector where the values correspond to probe IDs or a named numeric vector where the names correspond to probe IDs. The query and database definition is rather arbitrary. One can regard a database as a query and turn a query into a database.
Here we will demonstrate by considering CpGs that show tissue-specific methylation as the query. We are getting the B-cell-specific hypomethylation and using that for the following analysis.
library(SummarizedExperiment)
df <- rowData(sesameDataGet('MM285.tissueSignature'))
query <- df$Probe_ID[df$branch == "B_cell"]
head(query)## [1] "cg32668003_TC11" "cg45118317_TC11" "cg37563895_TC11" "cg46105105_BC11"
## [5] "cg47206675_TC21" "cg38855216_TC21"
This query set represents hypomethylated probes in Mouse B-cells from the MM285 platform. This specific query set has 168 probes.
The main work horse function for test enrichment of the query in the databases is testEnrichment (no surprise). This function calculates the extent of overlap and apply different statistics for enrichment testing. There are four testing scenarios depending on the type format of the query set and database sets. They are shown with the respective testing scenario in the table below.
| Continuous.DB | Discrete.DB | |
|---|---|---|
| Continuous Query | Correlation | GSEA |
| Discrete Query | GSEA | Fisher’s Exact Test |
The testEnrichment() will automatically perform statistical tests and report metrics about each of the the loaded database sets. Another critical set is the universe set. This is the set of all probes for a given platform. It can either be passed in as an argument called universeSet or the platform name can be passed with argument platform. If neither of these are supplied, the universe set will be implied from the probes. In all subsequent runs of testEnrichment() in this vignette, the platform will be specified.
df <- rowData(sesameDataGet('MM285.tissueSignature'))
query <- df$Probe_ID[df$branch == "fetal_brain" & df$type == "Hypo"]
results <- testEnrichment(query, "MM285.TFBS")## Selected the following database groups:
## 1. KYCG.MM285.TFBS.20220116
## Testing against 1346 database(s)...
query <- df$Probe_ID[df$branch == "fetal_liver" & df$type == "Hypo"]
results <- testEnrichment(query, "MM285.TFBS")## Selected the following database groups:
## 1. KYCG.MM285.TFBS.20220116
## Testing against 1346 database(s)...
results %>% dplyr::filter(overlap>10) %>%
dplyr::select(dbname, Target, estimate, test, FDR) %>% headThe output of each test contains at least four variables: the estimate (fold enrichment, not the test statistics), p-value, type of test, and whether meta data is included in the tested database set (hasMeta), as well as the name of the database set and the database group. By default, the estimate column is sorted.
It should be noted that the estimate (or test statistic) is test dependent and comparison between p-values should be limited to within the same type of test. For instance, the test statistics for Fisher’s exact test and GSEA are log fold change and the test statistic for Spearman’s test is simply the rank order correlation coefficient. For simplicity, we report all of the test types in one data frame.
The nQ and nD columns identify the length of the query set and the database set, respectively. Often, it’s important to examine the extent of overlap between the two sets, so that metric is reported as well in the overlap column.
We can visualize the results in a dot plot:
or a bar plot:
or a volcano plot:
and a lollipop plot:
If you have a list of cg groups and they were tested against the same set of databases, you can make a point range plot to summarize the overall trend:
## pick some big TFBS-overlapping CpG groups
cg_lists <- KYCG_getDBs("MM285.TFBS", silent=TRUE)
queries <- cg_lists[(sapply(cg_lists, length) > 40000)]
res <- lapply(queries, function(q) {
testEnrichment(q, "MM285.chromHMM", silent=TRUE)})
## note the input of the function is a list of testEnrichment outputs.
KYCG_plotPointRange(res)Here we picked some transcription factor binding sites-overlapping CpGs and tested against chromHMM states. As expected, most of these CGs are enriched at promoters and enhancers but depleted at heterchromatic regions.
Automating the enrichment test process only works when the number of database sets is small. This is important when targeting all genes as there are tens of thousands of genes on each platform. By testing only those genes that overlap with the query set, we can greatly reduce the number of tests. For this reason, the gene enrichment analysis is a special case of these enrichment tests. We can perform this analysis using the testEnrichmentGene() function.
df <- rowData(sesameDataGet('MM285.tissueSignature'))
query <- df$Probe_ID[df$branch == "fetal_liver" & df$type == "Hypo"]
results <- testEnrichmentGene(query)## Selected the following database groups:
## 1. KYCG.MM285.gene.GENCODEvM25.20220116
## Testing against 283 database(s)...
Using these sample results, we can plot a volcano plot and lollipop plot.
For example, this given query set is tissue specific hypomethylation of mouse brain. Rufy3 is shown to be significantly enriched in this set and it is known to be enriched in neurons (https://www.ncbi.nlm.nih.gov/gene/22902).
One can get all the genes associated with a probe set by
df <- rowData(sesameDataGet('MM285.tissueSignature'))
query <- df$Probe_ID[df$branch == "fetal_liver" & df$type == "Hypo"]
genes <- sesameData_getGenesByProbes(query)
genes## GRanges object with 168 ranges and 2 metadata columns:
## seqnames ranges strand | gene_name
## <Rle> <IRanges> <Rle> | <character>
## ENSMUSG00000026069.15 chr1 40429570-40465415 + | Il1rl1
## ENSMUSG00000101923.1 chr1 43409584-43409963 + | Gm29041
## ENSMUSG00000026782.15 chr1 60409619-60481158 + | Abi2
## ENSMUSG00000026288.14 chr1 87620312-87720507 + | Inpp5d
## ENSMUSG00000026425.15 chr1 131285251-131527352 - | Srgap2
## ... ... ... ... . ...
## ENSMUSG00000031119.4 chrX 52053021-52165252 - | Gpc4
## ENSMUSG00000016382.15 chrX 75785654-75875182 - | Pls3
## ENSMUSG00000071680.6 chrX 142412755-142412967 + | Gm7123
## ENSMUSG00000045180.13 chrX 152609509-152769465 - | Shroom2
## ENSMUSG00000031298.15 chrX 160390690-160498070 + | Adgrg2
## gene_type
## <character>
## ENSMUSG00000026069.15 protein_coding
## ENSMUSG00000101923.1 processed_pseudogene
## ENSMUSG00000026782.15 protein_coding
## ENSMUSG00000026288.14 protein_coding
## ENSMUSG00000026425.15 protein_coding
## ... ...
## ENSMUSG00000031119.4 protein_coding
## ENSMUSG00000016382.15 protein_coding
## ENSMUSG00000071680.6 processed_pseudogene
## ENSMUSG00000045180.13 protein_coding
## ENSMUSG00000031298.15 protein_coding
## -------
## seqinfo: 22 sequences from an unspecified genome; no seqlengths
Here we demonstrate the use of g:Profiler2 to perform Gene ontology enrichment analysis:
library(gprofiler2)
## use gene name
gostres <- gost(genes$gene_name, organism = "mmusculus")
gostres$result[order(gostres$result$p_value),]
gostplot(gostres)
## use Ensembl gene ID, note we need to remove the version suffix
gene_ids <- sapply(strsplit(names(genes),"\\."), function(x) x[1])
gostres <- gost(gene_ids, organism = "mmusculus")
gostres$result[order(gostres$result$p_value),]
gostplot(gostres)KnowYourCG builds in several metagene/meta-feature coordinates. One can test enrichment over meta genes or simply plot the mean over metagenes:
## Selected the following database groups:
## 1. KYCG.EPIC.metagene.20220126
The query may be a named continuous vector. In that case, either a gene enrichment score will be calculated (if the database is discrete) or a Spearman correlation will be calculated (if the database is continuous as well). The three other cases are shown below using biologically relevant examples.
To display this functionality, let’s load two numeric database sets individually. One is a database set for CpG density and the other is a database set corresponding to the distance of the nearest transcriptional start site (TSS) to each probe.
## Selected the following database groups:
## 1. KYCG.MM285.designGroup.20210210
res <- testEnrichment(query, "MM285.seqContextN")
res[, c("dbname", "test", "estimate", "FDR", "nQ", "nD", "overlap")]The estimate here is enrichment score. Negative enrichment score suggest enrichment with the higher values of the database and positive enrichment score represent enrichment with the smaller values. As expected, the designed TSS CpGs are significantly enriched in smaller TSS distance and higher CpG density.
Alternatively one can test the enrichment of a continuous query with discrete databases. Here we will use the methylation level from a sample as the query and test it against the chromHMM chromatin states.
beta_values <- getBetas(sesameDataGet("MM285.1.SigDF"))
res <- testEnrichment(beta_values, "MM285.chromHMM")## Selected the following database groups:
## 1. KYCG.MM285.chromHMM.20210210
## Testing against 18 database(s)...
As expected, chromatin states “Tss”, “Enh” has negative enrichment score, meaning these databases are associated with small values of the query (DNA methylation level). On the contrary, “Quies” states are associated with high methylation level.
In addition to hypothesis testing, knowYourCG also uses the curated database sets for feature engineering. We have a pre-curated summarized experiment containing a samplesheet and beta value matrix corresponding to about 467 MM285 samples with 20k probes. The samplesheet includes UIDs pertaining to the sample and several categorical/numerical features. To use this data for a linear model, we will extract the most relevant prevalent features.
We have found that it is computationally expensive to perform a linear model/generalized linear model on a feature set of individual CpGs. Additionally, interpreting the mechanism the significantly contributing CpGs is non-trivial due to their complex interactions. We hope to leverage these pre-curated database sets by using their beta value summary statistics as features instead. We will calculate the summary statistics for the betas matrix using a list of database sets. The default is to calculate the mean.
## Selected the following database groups:
## 1. KYCG.MM285.chromHMM.20210210
## Enh EnhG EnhLo EnhPois EnhPr
## [1,] 0.4747438 0.6719724 0.6432090 0.5546328 0.3883567
## [2,] 0.4629993 0.6726350 0.5992930 0.5677431 0.3881347
## [3,] 0.4857179 0.6724716 0.6359722 0.5672481 0.3889514
## [4,] 0.4985426 0.6888043 0.6279642 0.5803560 0.3987007
## [5,] 0.4720005 0.6698668 0.6420811 0.5530925 0.3851314
## [6,] 0.4978402 0.6910251 0.6072940 0.5807078 0.4133535
Just from the few database set means above, we can see that TSS are consistently hypomethylated, which is consistent with known biology.
library(wheatmap)
WHeatmap(both.cluster(stats)$mat, xticklabels=TRUE,
cmp=CMPar(stop.points=c("blue","yellow")))## R Under development (unstable) (2021-11-09 r81170)
## Platform: x86_64-apple-darwin20.6.0 (64-bit)
## Running under: macOS Big Sur 11.6.1
##
## Matrix products: default
## BLAS: /Users/zhouw3/.Renv/versions/4.2.dev/lib/R/lib/libRblas.dylib
## LAPACK: /Users/zhouw3/.Renv/versions/4.2.dev/lib/R/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] wheatmap_0.1.0 knitr_1.37
## [3] SummarizedExperiment_1.25.3 Biobase_2.55.0
## [5] GenomicRanges_1.47.6 GenomeInfoDb_1.31.4
## [7] IRanges_2.29.1 S4Vectors_0.33.10
## [9] MatrixGenerics_1.7.0 matrixStats_0.61.0
## [11] sesame_1.13.25 sesameData_1.13.26
## [13] ExperimentHub_2.3.5 AnnotationHub_3.3.8
## [15] BiocFileCache_2.3.4 dbplyr_2.1.1
## [17] BiocGenerics_0.41.2 rmarkdown_2.11
## [19] R6_2.5.1
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-7 bit64_4.0.5
## [3] filelock_1.0.2 RColorBrewer_1.1-2
## [5] httr_1.4.2 tools_4.2.0
## [7] bslib_0.3.1 utf8_1.2.2
## [9] DBI_1.1.2 colorspace_2.0-2
## [11] withr_2.4.3 DNAcopy_1.69.0
## [13] tidyselect_1.1.1 preprocessCore_1.57.0
## [15] bit_4.0.4 curl_4.3.2
## [17] compiler_4.2.0 cli_3.1.1
## [19] DelayedArray_0.21.2 labeling_0.4.2
## [21] sass_0.4.0 scales_1.1.1
## [23] randomForest_4.7-1 readr_2.1.2
## [25] proxy_0.4-26 rappdirs_0.3.3
## [27] stringr_1.4.0 digest_0.6.29
## [29] XVector_0.35.0 pkgconfig_2.0.3
## [31] htmltools_0.5.2 highr_0.9
## [33] fastmap_1.1.0 rlang_1.0.1
## [35] RSQLite_2.2.9 shiny_1.7.1
## [37] farver_2.1.0 jquerylib_0.1.4
## [39] generics_0.1.2 jsonlite_1.7.3
## [41] vroom_1.5.7 BiocParallel_1.29.12
## [43] dplyr_1.0.8 RCurl_1.98-1.6
## [45] magrittr_2.0.2 GenomeInfoDbData_1.2.7
## [47] Matrix_1.4-0 Rcpp_1.0.8
## [49] munsell_0.5.0 fansi_1.0.2
## [51] lifecycle_1.0.1 stringi_1.7.6
## [53] yaml_2.2.2 zlibbioc_1.41.0
## [55] plyr_1.8.6 grid_4.2.0
## [57] blob_1.2.2 parallel_4.2.0
## [59] promises_1.2.0.1 ggrepel_0.9.1
## [61] crayon_1.4.2 lattice_0.20-45
## [63] Biostrings_2.63.1 hms_1.1.1
## [65] KEGGREST_1.35.0 pillar_1.7.0
## [67] reshape2_1.4.4 glue_1.6.1
## [69] BiocVersion_3.15.0 evaluate_0.14
## [71] BiocManager_1.30.16 png_0.1-7
## [73] vctrs_0.3.8 tzdb_0.2.0
## [75] httpuv_1.6.5 gtable_0.3.0
## [77] purrr_0.3.4 assertthat_0.2.1
## [79] cachem_1.0.6 ggplot2_3.3.5
## [81] xfun_0.29 mime_0.12
## [83] xtable_1.8-4 e1071_1.7-9
## [85] later_1.3.0 class_7.3-19
## [87] tibble_3.1.6 AnnotationDbi_1.57.1
## [89] memoise_2.0.1 ellipsis_0.3.2
## [91] interactiveDisplayBase_1.33.0